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Abstract 

Constrained clustering has been well-studied in the unsupervised learning so- 
ciety. However, how to encode constraints into community detection process 
of the complex social networks remains a challenging problem. In this paper, 
we propose a semi-supervised learning framework for community structure 
detection. This framework implicitly encodes the must-link and cannot-link 
constraints by modifying the adjacency matrix of the network, which can also 
be regarded as the de-noising process of the consensus matrix of the com- 
munity structures. Our proposed method gives consideration to both the 
topology and the functions (background information) of the complex net- 
work, which improves the interpretability of the results. The comparisons 
performed on both the synthetic benchmarks and the real-world networks 
show that the proposed framework can significantly improve the detection 
performance with few constraints, which makes it an attractive methodology 
in the analysis of complex social networks. 

Keywords: de-nosing, nonnegative matrix factorization, spectral 
clustering, semi-supervised learning, community structure detection 



1. Introduction 

Evidences have shown that there are often modules or community struc- 
tures in complex social networks ([1]). For example, community structure 
could be a set of proteins that have similar functions in a protein-protein 
interaction (PPI) network, or a group of fans that like visiting similar kind 
of music web pages, or a club in university, etc. Though there is still no 
standard and clear definition of community structure, we may regard a com- 
munity structure in social networks as a set of nodes that have similar link- 
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pattern, or in other words, these nodes have similar preference and connect 
to the other nodes in a similar way ([2]). The most common and widely 
studied community structure is the subgraph that is densely interconnected 
but loosely connected with the rest of the graph. Meanwhile, there are also 
other types of community structures. Discovering the community structures 
in complex social networks is very important for revealing the organization 
and the functions of the network, such as understanding how the units in 
some systems communicate with each other and work together, or learning 
how the new ideas or diseases spread in a group of persons (0]), etc. 

How to detect the community structures has thus become a hot research 
topic, and many interesting models and algorithms have been developed and 
achieved good results. But all of these methods are in essence kind of unsu- 
pervised learning, meaning that they only make use of the network topology 
information. However, in many real applications, we also have some back- 
ground information that could be useful in detecting the structures. How 
to combine the information to guide the detecting process is an interesting 
problem that is worthy of working on. 

In this paper, we give a semi-supervised framework to incorporate prior in- 
formation into community structure detection. Under the proposed method- 
ology, one can easily provide pairwise constraints on a few nodes in the 
network, specifying whether they must or cannot be in the same community 
structure, based on the background information and domain knowledge. For 
example, the nodes that have similar functions should be must-link, or the 
nodes that have different opinions should be cannot-link. The framework 
implicitly encodes the must-link and cannot-link constraints by modifying 
the adjacency matrix of the network, which can also be regarded as the de- 
noising process of the consensus matrix of the community structures. The 
experimental results show the effectiveness of the proposed work. 

The rest of the paper is organized as follows: Sect. |2] discusses how to 
incorporate the prior information to guide the community structure detection 
process, and briefly reviews nonnegative matrix factorization (NMF) and 
spectral clustering, which are applied for the detection, Sect. E] gives the 
experimental results based on simulations and real world datasets, and finally 
Sect. H]is the conclusion and the prospect of future work. 
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2. Semi-supervised learning for community structure detection 

In this section, we give our semi-supervised framework for community 
structure detection. Firstly, we introduce the definition of the adjacency 
matrix of an undirected and unweighted simple graph G with n nodes: 

A [o] = f 1, if i~ j 
iJ 0, if % = j or % ^ j, 

where i ~ j means there is an edge between node % and j, and % oo j means 
there is no edge between them. Here is n x n and symmetric. 

Note that the diagonal elements of A^ are all zeros, but these zeros are 
obviously different from the ones at the off-diagonal positions which mean 
there are no connections between the nodes. Hence we here set the diago- 
nal elements of A^ to 1. The revised adjacency matrix is denoted by A^. 
Another variation of A^ is its complementary matrix = 1 — A^. 

2.1. Incorporating prior knowledge into adjacency matrix 

In many real applications, we often have some background information 
that can be used for community structure detection. Specifically, we consider 
the following two types of pairwise constraints: 

• Must-Link constraints Cml'- (hj) £ Cml means that the two nodes 
i and j must belong to the same community structure, 

• Cannot-Link constraints Cql- (hj) £ Cql means that the two 
nodes % and j cannot belong to the same community structure. 

We incorporate the constraints C M l and C C l into the adjacency matrix A^ 
(or A^) to get a new matrix B as follows: 

{a, if (i,j) E C M L 
0, i£(i,j)eC CL (1) 
Afj otherwise, 

where a is some positive constant. 

As one can see, if we set a to 1, and for all the pairs of nodes, we know 
whether they should belong to Cml or Cql, in other words, we know very 
well the community structures in the graph, the adjacency matrix will reduce 
to the standard consensus matrix, whose (i,j)th element means whether node 
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i and node j are in the same community structure, 1 means yes and means 
no. Hence from the point of view of consensus matrix, incorporating prior 
knowledge can be regarded as the de-noising process. 

We denote by type I the formula ([I]) if a — 1, and type II if a = 2. 

After incorporating background information into the adjacency matrix, 
we then apply nonnegative matrix factorization (NMF) and spectral cluster- 
ing, which are of the most common and widely-used models in unsupervised 
learning, for community structure detection. 

2.2. Nonnegative matrix factorization (NMF, 0, 0? Q/j 

NMF can be expressed as follows: given a nonnegative objective matrix 
X of size n x m, columns of which are samples and rows are features, we try 
to find two nonnegative matrices F of size n x k and G of size m x k such 
that: X rs FG T . This problem is often formulated as the following nonlinear 
programming: 

min J(X\\FG T ) (2) 

s.t. F^0,G^0, 

where J(X\\FG T ) is some cost function that measures the dissimilarity be- 
tween X and FG T , and ^ means that F and G should not have negative 
entries. The most popular algorithms designed for NMF are multiplicative 
update rules. The objective matrix X for NMF can be selected as B. 

In ref. M], it has been shown that the diffusion- kernel-based similarity 
matrix SF^j was the best choice for the objective matrix X among all the 
candidates, hence we also tested SK in this paper (see subsection 13. 3p . 



1 Definition of diffusion kernel K and the similarity matrix SK (0, K — lim (1 

k— >oo 

— ) n = expm(f3L), where L is the opposite Laplacian of A^: 
n 

1 if i ~ j 
Lij = ( ~di if i = j 

otherwise. 

And 



We set j3 — 0.2 in this paper. Note that there is a MATLAB command "expm" for the 
exponential of a matrix. 
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The community structures of the network can be obtained from G: node 
% is of community k if Gik is the largest element in the ith row of G. 

1) Standard NMF with least squares error: If J(X\\FG T ) is selected as the 
least squares error: J(X\\FG T ) = \\X — -FG T ||^, the algorithm of multi- 
plicative update rules can be summarized in Algorithm [TJ 



Algorithm 1 Nonnegative Matrix Factorization (Least Squares Error) 
Input: X, iter % In this paper, the iteration number iter is set to 100. 
Output: F, G. 
1: for t = 1 :iter do 

2- Fu-=Fu (XG)ifc 

± ik ■ - 1 ik 



3: Gik '■— Gik- 



(FG T G)ik 
(X T F)ik 



(GF T F) ik 
4: end for 



2) Standard NMF with K-L divergence: If J(X\\FG T ) is selected as the 
KL divergence: J(X\\FG T ) = V[A%- log _*g - X {j + (FG T ) ij ] 1 the 

corresponding update rules of F and G are: 

Gj k \ - X^ _ 

3) Symmetric NMF (SNMF): There is a variant of NMF for semi-supervised 
clustering, whose objective function can be formulated as: || X — GSG T \\p. 
The update rules of G and S are (0): 

n _ n (XGS)ik 



'S'ifc : ~ Sik 



(GSG T GS)ik ' 

(G T XG)ik 
(G T GSG T G)ik 
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2.3. Spectral Clustering ($]) 

Spectral clustering is very powerful in its simplicity and effectiveness, 
which can be summarized in Algorithm |2j Note that there are many vari- 
ations of the standard one, and the detailed analysis can be found in ref. 




Algorithm 2 Spectral Clustering 
Input: B eR nxn 

Output: Community Label Y G M nxl of the n nodes 
1: L — D X I 2 BD X I 2 ^ where D is the diagonal matrix with the element Da = 

2: Formimg the matrix X = [xi,X2,--- ,Xk] G W nxk , where Xi,i = 

1, 2, • • • , k are the top k eigenvectors of L. 
3: Normalizing X so that rows of X have the same L 2 norm: = 

4: Clustering rows of X into k clusters by K-means. 
5: Yi = j if the ith row is assigned to cluster j. 



2.4- An illustrative example 

We close this section by an illustrative example as follows: we try to 
detect the community structures in an undirected and unweighted network 
with 128 nodes. Hence there are 128 x (128 — 1)/2 pairs of nodes. The network 
has 4 communities with 32 nodes each. The heatmap of the corresponding 
adjacency matrix is shown as the leftmost in Fig. [D If we have prior 
knowledge about the network structure so that we can determine a percentage 
of pairs of nodes as must-link or cannot-link, we can incorporate them into 
A^. As one can see in Fig. (TJ the adjacency matrix becomes more and more 
clear as the percentage of pairs constrained increases, and finally reduces to 
the standard consensus matrix of the community structures. 

3. Experimental Results 

In this section, we empirically demonstrated the effectiveness of our pro- 
posed semi-supervised framework for the community structure detection prob- 
lem by applying NMF and spectral clustering with the de-noised consensus 
matrices to several well-studied networks. 
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3.1. Data Description 

We used both synthetic and real-world networks to test the effectiveness 
of our methods. The details of these datasets are as follows: 



GN (Girvan & Newman, Maybe the most widely used benchmarks 
are GN networks. The network has 128 nodes which are divided into 
four non-overlap communities with 32 nodes each. The degree of each 
node is Zin +Zout=16, in other words, each node averagely has exactly 
16 edges which randomly connect Zin nodes in its own community and 
Zout nodes in other communities. As one can see, with the increasing 
Zout, the community structures will become less clear and the problem 
more challenging. 



2) LFR (Lancichinetti, Fourtunato & Radicchi, [12J): Indeed, in most of the 
real applications, the community structures are more complicated than 
GN networks. The size of the network might be larger, or the numbers 
of the nodes in different communities might not be identical, or different 
nodes might have different positions, i.e., some are superstars or hubs 
and should have higher degrees while the others are leaves. The LFR 
benchmark networks are thus proposed to address these problems. In 
LFR networks, both the degree and the community size distributions are 
power laws, with exponents 7 and 0, which is more practical. Each node 
has a fraction 1 — fi of its links with the nodes in its own community and 
a fraction \x with the other ones. Here /1 is called the mixing parameter. 

We set the parameters of the LFR network as follows: the number of nodes 
was 1000, the average degree of the nodes was 20, the maximum degree 
was 50, the exponent of the degree distribution 7 was 2 and that of the 
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community size distribution f3 was 1, and the range of mixing parameter 
\l was from 0.1 to 0.8. The communities were non-overlapped. 

3) Karate ([131]): this dataset contains the network of friendships between 
34 members of a karate club at an American university. This club was by 
chance split into two smaller ones due to the divergence of opinions about 
the club fees. 

4) Football (Q): this dataset contains the network of American football 
games (not soccer) between Division IA colleges during regular season Fall 
2000. There are 115 nodes representing the football teams while an edge 
means there was a game between the teams connected by the edge. The 
teams were divided into 12 conferences, and all teams except few (mainly 
in two conferences) played against the ones in the same conference more 
frequently than those in other conferences. 

3.2. Assess Standards 

In our experiments, the normalized mutual information (NMI, [3]) was 
used as the standard to evaluate the community structure detection perfor- 
mance. The value can be formulated as follows: 



NMI(Mx,M 2 ) 



EE ray l°g- (i) (2 ) 

»=ij=i n\'n)' 



i. (i) k (2) 

vE n lo g — >X log — 



where M\ is the ground-truth cluster label and M 2 is the computed cluster 
label, k is the community number, n is the number of nodes, is the number 
of nodes in the ground-truth cluster i that are assigned to the computed 
cluster j, nf^ is the number of nodes in the ground-truth cluster i and rip is 
the number of nodes in the computed cluster j, log is the natural logarithm. 

Compared with simply counting the number of misclassified nodes, NMI is 
more informative, especially suitable for imbalanced datasets (i.e., the num- 
bers of the nodes in different communities are not identical). For example, in 
a four-sample toy data, the ground-truth cluster label could be 1, 1, 1, 2. The 
computed cluster labels of two different models were 1,1,1,1 and 1,1,2,2 
respectively. In other words, the smaller cluster was masked and not de- 
tected by the first model, hence the second model should be better though it 
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also had one sample mis-clustered. But the accuracy (number of misclassified 
nodes divided by the number of nodes in the graph) results of these two mod- 
els were all 75%, which was misleading. On the other hand, the NMI under 

3-4 1-4 

this case was (the numerator of NMI was: 3 log + 1 ■ log = 0) and 

3-4 1 • 4 ' 

34.56% respectively, which was relatively more reasonable and informative. 



In subsection 13.51 we also used the modularity function Q ( [15l . Il6l |) as 
the standard to determine the best community number k. The function can 
be defined as follows: 



- L(Vc k ,Vc k ) _ L(V Ck ,V) 2 

(k ■ mv) { wv) ) Jj 

where is the kth community in the graph, L(Vi, V 2 ) = ^ a^, and 

ieVi ,j£V2,i^j 

dij is the element of M a \ 

The larger the values of NMI and Q score, the better the graph partition- 
ing results. 

3.3. Performance of NMF based on Different Similarity Measures 

In this subsection, we compared the clustering performance of NMF-based 
models with different similarity measures including A^°\ and SK. 

Fig. |2] shows the NMI results on the GN networks and the LFR networks, 
from which one can observe that: i) The averaged NMI results decrease with 
the increasing Zout for GN networks and the increasing n for LFR networks; 
ii) There is no single winner, but is a competitive one. Note that getting 
the diffusion kernel is time consuming for large scale networks, hence we used 
A^ for the NMF-based models in the following experiments. 

3.4- Results Analysis 

In this subsection, we systematically compared the results of NMI ob- 
tained by the models on the artificial datasets and the karate network with 
prior knowledge available. For an undirected network with n nodes, there 
are totally n(n — l)/2 node pairs available. We randomly picked out some 
pairs of nodes, and determined whether they belonged to Cml or Cql- if the 
two nodes had the same community label, they were must-link, otherwise, 
they were cannot-link. The results were averages of ten trails and given in 
Figures [3], H] and Table [TJ From these figures and table, one can observe that: 
i) The trends of all the models are generally identical and the values of the 
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Figure 2: Comparison of four similarity measures on datasets GN and LFR. "LSE" means 
least squares error, "KL" means K-L divergence, "SNMF" means symmetric NMF. 
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averaged NMI increase with the increasing percentage of pairs constrained; 
ii) for synthetic datasets: GN and LFR, the type II framework always gets 
better, and the spectral clustering is better than the NMF-based models, 
especially for the LFR datasets; iii) for the karate network, NMF with least 
squares error performs better. In summary, our proposed semi-supervised 
framework did greatly enhance the results of community structure detection 
by benefitting from the user provided background information. 






NMF_LSE 




NMF_KL 




SNMF 


-A- 


Spectral 



1% 2% 3% 4% 5% 10% 20% 30% 40% 50% 

percentage of pairs constrained 



(a) Type I: a = 1. 



* A » » 




T% 2% 3% 4% 5% 10% 20% 30% 40% 50% 

percentage of pairs constrained 



(b) Type II: a = 2. 



Figure 3: Averaged NMI of different models for different percentages of node pairs con- 
strained on GN datasets. The black horizontal line is the best NMI result that had ever 
got by NMF_LSE, NMF_KL, SNMF and the spectral clustering with no prior knowledge 
available. Means of "LSE" , "KL" and "SNMF" are identical with that in Fig. H 



Table 1: Averaged NMI of different models given different percentages of node pairs 
constrained under the type II on the karate dataset. "P" means percentage of node pairs 



constrained. Means of "LSE" , "KL" and "SNMF" are identical with that in Fig. H 



p 


1% 


2% 


3% 


4% 


5% 


10% 


20% 


NMF.LSE 
NMF.KL 
SNMF 
Spectral Clustering 


99.84% 

73.38% 
59.53% 
90.19% 


98.86% 

73.44% 
51.50% 
90.19% 


99.67% 

82.86% 
54.06% 
95.10% 


99.84% 

85.18% 
60.96% 
96.73% 


99.84% 

89.24% 
53.74% 
95.10% 


100% 

89.14% 
57.91% 
100% 


100% 

98.37% 
56.57% 
100% 



3.5. A Case Study: College Football Network 

In this subsection, we used the college football network to do study, 
and saw the partitioning results of NMF_LSE and spectral clustering given 
different percentages of pairs constrained. The teams were separated into 12 
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(a) Type I: a = 1. 



(b) Type II: a = 2. 



Figure 4: Averaged NMI of different models for different percentages of node pairs con- 
strained on LFR datasets. Means of the black horizontal line, "LSE" , "KL" and "SNMF" 
are identical with that in Fig. [3] 

conferences, and most of them played against the ones in the same conference 
more frequently. However, the teams 37, 43, 81, 83, 91 (in conference IA 
Independents) , 12, 25, 51, 60, 64, 70, 98 (in conference Sunbelt), 111, 29 and 
59 played more frequently against the ones in other conferences. Table [2] lists 
the basic information about these teams, from which one can observe that 
three out of five teams in IA Independents never played against the ones in 
the same conference and the other two teams played only once. 

Firstly, we tried to determine the community number k. We compared 
the values of modularity function Q at different k, and the function achieved 
its peak value at k = 11. By combining the results of Q values in Table 
E] with the information in Table [U we set the community number k = 11 
and the teams in IA Independents would be assigned to the other eleven 
conferences based on the outputs of NMF and spectral clustering. Hence 
there were 115 — 5 = 110 teams with ground-truth conference labels and 
totally 110 x (110 — l)/2 = 5995 team pairs available. We randomly selected 
some pairs as constraints: if the two teams of the pair were in the same 
conference, they were must-link (ML), otherwise, they were cannot-link (CL). 

Figure [5] gives the resulting partitions of NMF and spectral clustering 
corresponding to different percentages of pairs constrained. When given no 
prior knowledge constrained, there were 5 ~ 6 abnormal teams mis-clustered: 
teams 29, 60, 64, 98, 111 with NMF and teams 29, 59, 60, 64, 98, 111 with 
spectral clustering; But after randomly given 5 percent of pairs constrained, 
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the results were significantly improved and only two abnormal teams were 
mis-clustered: teams 29 and 111. Finally, when given 20 percent, there was 
only one team mis-clustered: team 59. From these results, one can see that: 
1) NMF and spectral clustering are really good enough in that only some 
abnormal teams are not correctly clustered; 2) our semi-supervised clustering 
framework does take the background information and domain knowledge into 
consideration, which makes the partitioning results more explainable. 



Table 2: Basic information about the abnormal teams that played more frequently against 
the ones in the other conferences. "T" means the team id, "F" means the times that the 
team played against the other ones in the same conference or in the other conferences, "S" 
means the same conference, "O" means the other conferences. 
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8 
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10 
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6 


98 


3 


5 









Table 3: Values of averaged Q functions of NMF_LSE and spectral clustering. The range 
of the community number k that we have tried is 8 ~ 12. The peak values were achieved 
at k = 11. Meaning of "LSE" is identical with that in Fig. [5J 



— — -___Community Number 
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11 


12 


Models 








NMF.LSE 


0.5770 


0.5831 


0.5890 


0.5934 


0.5885 


Spectral Clustering 


0.5932 


0.5927 


0.5942 


0.5978 


0.5951 



3.6. How to give the prior knowledge: randomly or based-on-rule 

Finally, we present an interesting problem: how to select the prior infor- 
mation and incorporate them into the models? To the best of our knowledge, 
in practice, the most widely used method is to randomly select some pairs 
of samples or nodes and manually determine whether they are must-link or 
cannot-link based on the domain knowledge. But are there any better meth- 
ods to select the pairs that can either reduce the workload or improve the 
clustering performance, or both? Indeed, for a large scale network, a very 
small percentage of pairs may still mean a huge workload. In this subsec- 
tion, we attempted to introduce a new rule-based method to address this 
problem. Firstly, We computed the hamming distances between all pairs of 
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Figure 5: Comparison of the results of NMF and spectral clustering corresponding to 
different percentages of pairs constrained, (a): Real grouping in football dataset. There 
are 12 conferences of 8-12 teams (nodes) each, (b), (c): Outputs of NMF and spectral 
clustering respectively without any prior knowledge, (d): Outputs of NMF and spectral 
clustering corresponding to 5 percent of pairs constrained, (e), (f): Outputs of NMF and 
spectral clustering respectively, corresponding to 20 percent of pairs constrained. 
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the nodes (rows of A^), and sorted the distances to find the largest and 
the smallest ones (this step can be finished by programming calculation, not 
manually) . Then we manually decided whether the selected pairs were must- 
link or cannot-link and incorporated them into the clustering process. The 
results on GN datasets are shown in Fig. [61 from which one can observe 
that our preliminary results are not good enough compared with that in sub- 
section 13.41 Hence we leave the problem open and believe that it deserves 
further study. 




1% 2% 3% 4% 5% 10% 20% 30% 40% 50% 

percentage of pairs constrained 



Figure 6: Averaged NMI of different models for different percentages of node pairs con- 
strained under the type II on GN datasets. The prior knowledge are given based on rule. 
Means of The black horizontal line, "LSE" , "KL" and "SNMF" are identical with that in 

Fig. m 

4. Conclusions and Future work 

In this paper, we have introduced a semi-supervised community structure 
detection framework for complex social network analysis. The framework 
adds the supervision of pairwise must-link and cannot-link constraints into 
the adjacency matrix, which can be regarded as de-noising of the consen- 
sus matrix of community structures. The experiments on both the synthetic 
and real- world networks have demonstrated the effectiveness of the proposed 
framework. In summary, it can combine the network's functions (background 
information and domain knowledge) with its topology, making the commu- 
nity structure detection more effective and the results more practical. 

We would like to close this paper by raising two interesting problems. 
Firstly, as we have mentioned in subsection I3.6[ are there any better methods 
that can be used for selecting the constraints? A good attempt is the work in 
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ref . 17| , which selected the constraints based on various similarity measures, 
not randomly. Secondly, the proposed framework is very flexible, and can be 
naturally combined with the models in ref. 



18, 19, 20 . Researches on these 



models are our future working directions. 
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